Chronic Kidney Disease Analysis¶
In this EDA project, we'll dive into the world of chronic kidney disease (CKD) using a dataset from Kaggle. Analyzing health data can seem a bit intimidating, but it's a critical task that helps us understand complex medical conditions.
For doctors and healthcare professionals, diagnosing a chronic disease isn't always straightforward.
It requires a careful look at many different factors, from a patient's lab results to their overall health indicators.
This process is made even more challenging when trying to identify patterns that predict the onset of a disease.
This is where we, as data analysts, can help! Through this Exploratory Data Analysis (EDA) project, we will clean, analyze, and visualize the chronic kidney disease dataset.
Our goal is to uncover valuable insights and correlations that could assist in identifying the key factors associated with the disease. By exploring this data, we hope to make the task of early detection and risk assessment a little less daunting.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('Kidney_disease.csv')
df
| id | age | bp | sg | al | su | rbc | pc | pcc | ba | ... | pcv | wc | rc | htn | dm | cad | appet | pe | ane | classification | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 48.0 | 80.0 | 1.020 | 1.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 44 | 7800 | 5.2 | yes | yes | no | good | no | no | ckd |
| 1 | 1 | 7.0 | 50.0 | 1.020 | 4.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 38 | 6000 | NaN | no | no | no | good | no | no | ckd |
| 2 | 2 | 62.0 | 80.0 | 1.010 | 2.0 | 3.0 | normal | normal | notpresent | notpresent | ... | 31 | 7500 | NaN | no | yes | no | poor | no | yes | ckd |
| 3 | 3 | 48.0 | 70.0 | 1.005 | 4.0 | 0.0 | normal | abnormal | present | notpresent | ... | 32 | 6700 | 3.9 | yes | no | no | poor | yes | yes | ckd |
| 4 | 4 | 51.0 | 80.0 | 1.010 | 2.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 35 | 7300 | 4.6 | no | no | no | good | no | no | ckd |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 395 | 395 | 55.0 | 80.0 | 1.020 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 47 | 6700 | 4.9 | no | no | no | good | no | no | notckd |
| 396 | 396 | 42.0 | 70.0 | 1.025 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 54 | 7800 | 6.2 | no | no | no | good | no | no | notckd |
| 397 | 397 | 12.0 | 80.0 | 1.020 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 49 | 6600 | 5.4 | no | no | no | good | no | no | notckd |
| 398 | 398 | 17.0 | 60.0 | 1.025 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 51 | 7200 | 5.9 | no | no | no | good | no | no | notckd |
| 399 | 399 | 58.0 | 80.0 | 1.025 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 53 | 6800 | 6.1 | no | no | no | good | no | no | notckd |
400 rows × 26 columns
df.head()
| id | age | bp | sg | al | su | rbc | pc | pcc | ba | ... | pcv | wc | rc | htn | dm | cad | appet | pe | ane | classification | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 48.0 | 80.0 | 1.020 | 1.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 44 | 7800 | 5.2 | yes | yes | no | good | no | no | ckd |
| 1 | 1 | 7.0 | 50.0 | 1.020 | 4.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 38 | 6000 | NaN | no | no | no | good | no | no | ckd |
| 2 | 2 | 62.0 | 80.0 | 1.010 | 2.0 | 3.0 | normal | normal | notpresent | notpresent | ... | 31 | 7500 | NaN | no | yes | no | poor | no | yes | ckd |
| 3 | 3 | 48.0 | 70.0 | 1.005 | 4.0 | 0.0 | normal | abnormal | present | notpresent | ... | 32 | 6700 | 3.9 | yes | no | no | poor | yes | yes | ckd |
| 4 | 4 | 51.0 | 80.0 | 1.010 | 2.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 35 | 7300 | 4.6 | no | no | no | good | no | no | ckd |
5 rows × 26 columns
df.tail()
| id | age | bp | sg | al | su | rbc | pc | pcc | ba | ... | pcv | wc | rc | htn | dm | cad | appet | pe | ane | classification | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 395 | 395 | 55.0 | 80.0 | 1.020 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 47 | 6700 | 4.9 | no | no | no | good | no | no | notckd |
| 396 | 396 | 42.0 | 70.0 | 1.025 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 54 | 7800 | 6.2 | no | no | no | good | no | no | notckd |
| 397 | 397 | 12.0 | 80.0 | 1.020 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 49 | 6600 | 5.4 | no | no | no | good | no | no | notckd |
| 398 | 398 | 17.0 | 60.0 | 1.025 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 51 | 7200 | 5.9 | no | no | no | good | no | no | notckd |
| 399 | 399 | 58.0 | 80.0 | 1.025 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 53 | 6800 | 6.1 | no | no | no | good | no | no | notckd |
5 rows × 26 columns
df.sample(5)
| id | age | bp | sg | al | su | rbc | pc | pcc | ba | ... | pcv | wc | rc | htn | dm | cad | appet | pe | ane | classification | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 272 | 272 | 56.0 | 80.0 | 1.025 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 42 | 5600 | 5.5 | no | no | no | good | no | no | notckd |
| 263 | 263 | 45.0 | 80.0 | 1.020 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 45 | 8600 | 5.2 | no | no | no | good | no | no | notckd |
| 371 | 371 | 28.0 | 60.0 | 1.025 | 0.0 | 0.0 | normal | normal | notpresent | notpresent | ... | 51 | 6500 | 5.0 | no | no | no | good | no | no | notckd |
| 226 | 226 | 64.0 | 100.0 | 1.015 | 4.0 | 2.0 | abnormal | abnormal | notpresent | present | ... | 26 | 7500 | 3.4 | yes | yes | no | good | yes | no | ckd |
| 220 | 220 | 36.0 | 80.0 | 1.010 | 0.0 | 0.0 | NaN | normal | notpresent | notpresent | ... | 36 | 8800 | NaN | no | no | no | good | no | no | ckd |
5 rows × 26 columns
df.shape
(400, 26)
df.dtypes
id int64 age float64 bp float64 sg float64 al float64 su float64 rbc object pc object pcc object ba object bgr float64 bu float64 sc float64 sod float64 pot float64 hemo float64 pcv object wc object rc object htn object dm object cad object appet object pe object ane object classification object dtype: object
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 400 entries, 0 to 399 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 400 non-null int64 1 age 391 non-null float64 2 bp 388 non-null float64 3 sg 353 non-null float64 4 al 354 non-null float64 5 su 351 non-null float64 6 rbc 248 non-null object 7 pc 335 non-null object 8 pcc 396 non-null object 9 ba 396 non-null object 10 bgr 356 non-null float64 11 bu 381 non-null float64 12 sc 383 non-null float64 13 sod 313 non-null float64 14 pot 312 non-null float64 15 hemo 348 non-null float64 16 pcv 330 non-null object 17 wc 295 non-null object 18 rc 270 non-null object 19 htn 398 non-null object 20 dm 398 non-null object 21 cad 398 non-null object 22 appet 399 non-null object 23 pe 399 non-null object 24 ane 399 non-null object 25 classification 400 non-null object dtypes: float64(11), int64(1), object(14) memory usage: 81.4+ KB
df.describe()
| id | age | bp | sg | al | su | bgr | bu | sc | sod | pot | hemo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 400.000000 | 391.000000 | 388.000000 | 353.000000 | 354.000000 | 351.000000 | 356.000000 | 381.000000 | 383.000000 | 313.000000 | 312.000000 | 348.000000 |
| mean | 199.500000 | 51.483376 | 76.469072 | 1.017408 | 1.016949 | 0.450142 | 148.036517 | 57.425722 | 3.072454 | 137.528754 | 4.627244 | 12.526437 |
| std | 115.614301 | 17.169714 | 13.683637 | 0.005717 | 1.352679 | 1.099191 | 79.281714 | 50.503006 | 5.741126 | 10.408752 | 3.193904 | 2.912587 |
| min | 0.000000 | 2.000000 | 50.000000 | 1.005000 | 0.000000 | 0.000000 | 22.000000 | 1.500000 | 0.400000 | 4.500000 | 2.500000 | 3.100000 |
| 25% | 99.750000 | 42.000000 | 70.000000 | 1.010000 | 0.000000 | 0.000000 | 99.000000 | 27.000000 | 0.900000 | 135.000000 | 3.800000 | 10.300000 |
| 50% | 199.500000 | 55.000000 | 80.000000 | 1.020000 | 0.000000 | 0.000000 | 121.000000 | 42.000000 | 1.300000 | 138.000000 | 4.400000 | 12.650000 |
| 75% | 299.250000 | 64.500000 | 80.000000 | 1.020000 | 2.000000 | 0.000000 | 163.000000 | 66.000000 | 2.800000 | 142.000000 | 4.900000 | 15.000000 |
| max | 399.000000 | 90.000000 | 180.000000 | 1.025000 | 5.000000 | 5.000000 | 490.000000 | 391.000000 | 76.000000 | 163.000000 | 47.000000 | 17.800000 |
df.isnull().sum()
id 0 age 9 bp 12 sg 47 al 46 su 49 rbc 152 pc 65 pcc 4 ba 4 bgr 44 bu 19 sc 17 sod 87 pot 88 hemo 52 pcv 70 wc 105 rc 130 htn 2 dm 2 cad 2 appet 1 pe 1 ane 1 classification 0 dtype: int64
df.columns
Index(['id', 'age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr',
'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad',
'appet', 'pe', 'ane', 'classification'],
dtype='object')
df['id']
0 0
1 1
2 2
3 3
4 4
...
395 395
396 396
397 397
398 398
399 399
Name: id, Length: 400, dtype: int64
df.drop('id', axis = 1, inplace = True)
df.columns
Index(['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad',
'appet', 'pe', 'ane', 'classification'],
dtype='object')
df.columns = ['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar', 'red_blood_cells', 'pus_cell',
'pus_cell_clumps', 'bacteria', 'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium',
'potassium', 'haemoglobin', 'packed_cell_volume', 'white_blood_cell_count', 'red_blood_cell_count',
'hypertension', 'diabetes_mellitus', 'coronary_artery_disease', 'appetite', 'peda_edema',
'aanemia', 'class']
df.columns
Index(['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar',
'red_blood_cells', 'pus_cell', 'pus_cell_clumps', 'bacteria',
'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium',
'potassium', 'haemoglobin', 'packed_cell_volume',
'white_blood_cell_count', 'red_blood_cell_count', 'hypertension',
'diabetes_mellitus', 'coronary_artery_disease', 'appetite',
'peda_edema', 'aanemia', 'class'],
dtype='object')
Converting packed_cell_volume column from object --> int/float¶
df['packed_cell_volume'] # --> Initially object but must be integer
0 44
1 38
2 31
3 32
4 35
..
395 47
396 54
397 49
398 51
399 53
Name: packed_cell_volume, Length: 400, dtype: object
df['packed_cell_volume'].unique() # --> Due to the presence of '\t?' it turns out to be string/object
array(['44', '38', '31', '32', '35', '39', '36', '33', '29', '28', nan,
'16', '24', '37', '30', '34', '40', '45', '27', '48', '\t?', '52',
'14', '22', '18', '42', '17', '46', '23', '19', '25', '41', '26',
'15', '21', '43', '20', '\t43', '47', '9', '49', '50', '53', '51',
'54'], dtype=object)
df['packed_cell_volume'] = pd.to_numeric(df['packed_cell_volume'], errors = 'coerce')
# coerce means supress/ignore the error
df['packed_cell_volume'].dtype
dtype('float64')
df['packed_cell_volume'].unique() # All string characters are converted into nan(numeric) values
array([44., 38., 31., 32., 35., 39., 36., 33., 29., 28., nan, 16., 24.,
37., 30., 34., 40., 45., 27., 48., 52., 14., 22., 18., 42., 17.,
46., 23., 19., 25., 41., 26., 15., 21., 43., 20., 47., 9., 49.,
50., 53., 51., 54.])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 400 entries, 0 to 399 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 391 non-null float64 1 blood_pressure 388 non-null float64 2 specific_gravity 353 non-null float64 3 albumin 354 non-null float64 4 sugar 351 non-null float64 5 red_blood_cells 248 non-null object 6 pus_cell 335 non-null object 7 pus_cell_clumps 396 non-null object 8 bacteria 396 non-null object 9 blood_glucose_random 356 non-null float64 10 blood_urea 381 non-null float64 11 serum_creatinine 383 non-null float64 12 sodium 313 non-null float64 13 potassium 312 non-null float64 14 haemoglobin 348 non-null float64 15 packed_cell_volume 329 non-null float64 16 white_blood_cell_count 295 non-null object 17 red_blood_cell_count 270 non-null object 18 hypertension 398 non-null object 19 diabetes_mellitus 398 non-null object 20 coronary_artery_disease 398 non-null object 21 appetite 399 non-null object 22 peda_edema 399 non-null object 23 aanemia 399 non-null object 24 class 400 non-null object dtypes: float64(12), object(13) memory usage: 78.3+ KB
df['white_blood_cell_count'].unique()
array(['7800', '6000', '7500', '6700', '7300', nan, '6900', '9600',
'12100', '4500', '12200', '11000', '3800', '11400', '5300', '9200',
'6200', '8300', '8400', '10300', '9800', '9100', '7900', '6400',
'8600', '18900', '21600', '4300', '8500', '11300', '7200', '7700',
'14600', '6300', '\t6200', '7100', '11800', '9400', '5500', '5800',
'13200', '12500', '5600', '7000', '11900', '10400', '10700',
'12700', '6800', '6500', '13600', '10200', '9000', '14900', '8200',
'15200', '5000', '16300', '12400', '\t8400', '10500', '4200',
'4700', '10900', '8100', '9500', '2200', '12800', '11200', '19100',
'\t?', '12300', '16700', '2600', '26400', '8800', '7400', '4900',
'8000', '12000', '15700', '4100', '5700', '11500', '5400', '10800',
'9900', '5200', '5900', '9300', '9700', '5100', '6600'],
dtype=object)
df['white_blood_cell_count'] = pd.to_numeric(df['white_blood_cell_count'], errors = 'coerce')
df['white_blood_cell_count'].dtype
dtype('float64')
df['red_blood_cell_count'] = pd.to_numeric(df['red_blood_cell_count'], errors = 'coerce')
df['red_blood_cell_count'].dtype
dtype('float64')
Categorical = [col for col in df.columns if df[col].dtype == 'object']
Categorical
['red_blood_cells', 'pus_cell', 'pus_cell_clumps', 'bacteria', 'hypertension', 'diabetes_mellitus', 'coronary_artery_disease', 'appetite', 'peda_edema', 'aanemia', 'class']
Numerical = [col for col in df.columns if df[col].dtype != 'object']
Numerical
['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar', 'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium', 'potassium', 'haemoglobin', 'packed_cell_volume', 'white_blood_cell_count', 'red_blood_cell_count']
for col in Categorical:
print(f' {col} : \n {df[col].unique()}')
red_blood_cells : [nan 'normal' 'abnormal'] pus_cell : ['normal' 'abnormal' nan] pus_cell_clumps : ['notpresent' 'present' nan] bacteria : ['notpresent' 'present' nan] hypertension : ['yes' 'no' nan] diabetes_mellitus : ['yes' 'no' ' yes' '\tno' '\tyes' nan] coronary_artery_disease : ['no' 'yes' '\tno' nan] appetite : ['good' 'poor' nan] peda_edema : ['no' 'yes' nan] aanemia : ['no' 'yes' nan] class : ['ckd' 'ckd\t' 'notckd']
df['diabetes_mellitus'].replace(to_replace = {' yes' : 'yes', '\tno' : 'no', '\tyes' : 'yes'}, inplace = True)
df['diabetes_mellitus'].unique()
array(['yes', 'no', nan], dtype=object)
df['coronary_artery_disease'].replace(to_replace = {'\tno':'no'}, inplace = True)
df['coronary_artery_disease'].unique()
array(['no', 'yes', nan], dtype=object)
df['class'].replace(to_replace = {'ckd\t':'ckd'}, inplace = True)
df['class'].unique()
array(['ckd', 'notckd'], dtype=object)
df['class'] = df['class'].map({'ckd': 1, 'notckd': 0})
Univariet analysis¶
plt.figure(figsize = (10,6))
sns.histplot(df['age'].dropna(), kde = True, bins = 20)
plt.title("Distribution of age")
plt.xlabel('Age')
plt.show()
Insights:¶
The histogram and the corresponding KDE curve show that the distribution of age is left-skewed (or negatively skewed). This indicates that the majority of individuals in this dataset are concentrated in the older age brackets, with a smaller number of younger individuals.
df.columns
Index(['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar',
'red_blood_cells', 'pus_cell', 'pus_cell_clumps', 'bacteria',
'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium',
'potassium', 'haemoglobin', 'packed_cell_volume',
'white_blood_cell_count', 'red_blood_cell_count', 'hypertension',
'diabetes_mellitus', 'coronary_artery_disease', 'appetite',
'peda_edema', 'aanemia', 'class'],
dtype='object')
sns.countplot(x = 'hypertension', data = df, palette = 'Set2')
<Axes: xlabel='hypertension', ylabel='count'>
Insights:¶
The chart shows that the number of individuals without hypertension ('no') is significantly higher than the number of individuals with hypertension ('yes').
Count Comparison¶
'No' Hypertension: The count for individuals without hypertension is approximately 250.
'Yes' Hypertension: The count for individuals with hypertension is approximately 150.
This suggests that hypertension is not a universal condition in this dataset, and the majority of the population does not have it.
plt.figure(figsize = (10,8))
sns.boxplot(x = 'class', y = 'blood_urea', data = df, palette = 'viridis')
plt.title('Boxplot')
Text(0.5, 1.0, 'Boxplot')
Insights:¶
There is a significant difference in blood urea levels between the two classes. The distribution for class 1 is much higher and more spread out than for class 0. This suggests a strong positive correlation between higher blood urea levels and the condition represented by class 1.
class 1 has numerous, significant outliers with extremely high blood urea levels, some reaching nearly 400 this suggest that few indivisuals of this class have extremely elevated blood urea whereas class 0 is much more tighter distribution.
sns.violinplot(x = 'class', y = 'serum_creatinine', data = df, palette = 'muted')
<Axes: xlabel='class', ylabel='serum_creatinine'>
sns.countplot(x = 'aanemia', data = df,palette = 'pastel')
<Axes: xlabel='aanemia', ylabel='count'>
Insights:¶
The chart shows that number of indivisuals with aanemia ("no") is much more higher than the number of indivisuals with aanemia("yes").
Count comparision¶
'No' aanemia : The count for indivisuals without aanemia is approximately 350.
'Yes' aanemia : The count for indivisuals with aanemia is slightly greator than 50.
This suggests that aanemia is not the universal condition and majority of population doesnt have it.
df['appetite'].unique()
array(['good', 'poor', nan], dtype=object)
x = df['appetite'].value_counts()
x
appetite good 317 poor 82 Name: count, dtype: int64
plt.figure(figsize = (8,8))
plt.pie(x, labels = x.index, autopct = '%.1f%%', colors = ['lightpink','lightcoral'],explode = (0,0.1), shadow = True, startangle=90)
plt.title('Pie chart for appetite')
plt.show()
x.plot.pie(autopct = '%1.1f%%', colors = ['lightpink','lightcoral'],explode = (0,0.1), shadow = True, startangle=90)
<Axes: ylabel='count'>
Insights:¶
The portion of population having appetite 'good' (79.4%) is much higher as compared to the portion of population having appetite 'ppor' (20.6%)
df['pus_cell_clumps']
0 notpresent
1 notpresent
2 notpresent
3 present
4 notpresent
...
395 notpresent
396 notpresent
397 notpresent
398 notpresent
399 notpresent
Name: pus_cell_clumps, Length: 400, dtype: object
sns.countplot(x = df['pus_cell_clumps'], palette = 'Set1')
<Axes: xlabel='pus_cell_clumps', ylabel='count'>
Insights¶
The number of indivisuals with pus_cell_clumps ('not present') is much more higher than number of indivisuals with pus_cell_clumps ('present')
Count comparision¶
'not present pus_cell_clumps: The count of indivisuals without pus_cell_clumps is approximately 350.
'present' pus_cell_clumps: The count of indivisuals with pus_cell_clumps is nearly 50.
This indicates that pus_cell_clumps is not universal condition and majority of population doesnt have it.
df['white_blood_cell_count']
0 7800.0
1 6000.0
2 7500.0
3 6700.0
4 7300.0
...
395 6700.0
396 7800.0
397 6600.0
398 7200.0
399 6800.0
Name: white_blood_cell_count, Length: 400, dtype: float64
sns.histplot(df['white_blood_cell_count'].dropna(), bins = 20, kde = True, color = 'darkred')
<Axes: xlabel='white_blood_cell_count', ylabel='Count'>
Insights:¶
The plot shows that the distribution of white blood cell count is right-skewed (or positively skewed). This means that the majority of the data points are concentrated on the lower end of the count, with a long tail extending to the right, representing a smaller number of individuals with very high counts.
# Donut plot - donout chart or ring chart
df['diabetes_mellitus'].value_counts().plot.pie(autopct = '%1.1f%%', wedgeprops = dict(width = 0.5))
<Axes: ylabel='count'>
Insights:¶
The majority of the population, at 65.6%, does not have diabetes mellitus. The remaining 34.4% of the population does.
sns.countplot(x = 'coronary_artery_disease', data = df, palette = 'Set2')
<Axes: xlabel='coronary_artery_disease', ylabel='count'>
Insights¶
The number of indivisuals with coronary_artery_disease ('no') is much more higher than number of indivisuals with coronary_artery_disease ('yes')
Count comparision 'no' coronary_artery_disease: The count of indivisuals without coronary_artery_disease is approximately 360.
'yes' coronary_artery_disease: The count of indivisuals with coronary_artery_disease is nearly 30.
This indicates that coronary_artery_disease is not universal condition and majority of population doesnt have it.
df.columns
Index(['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar',
'red_blood_cells', 'pus_cell', 'pus_cell_clumps', 'bacteria',
'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium',
'potassium', 'haemoglobin', 'packed_cell_volume',
'white_blood_cell_count', 'red_blood_cell_count', 'hypertension',
'diabetes_mellitus', 'coronary_artery_disease', 'appetite',
'peda_edema', 'aanemia', 'class'],
dtype='object')
sns.countplot(x = 'peda_edema', data = df, palette = 'pastel')
<Axes: xlabel='peda_edema', ylabel='count'>
Insights¶
The number of indivisuals with peda_edema ('no') is much more higher than number of indivisuals with peda_edema ('yes')
Count comparision 'no' peda_edema: The count of indivisuals without peda_edema is approximately 325.
'yes' peda_edema: The count of indivisuals with peda_edema is nearly 75.
This indicates that peda_edema is not universal condition and majority of population doesnt have it.
sns.countplot(x = 'bacteria', data = df, palette = 'muted')
<Axes: xlabel='bacteria', ylabel='count'>
Bivariet Analysis¶
sns.scatterplot(x = 'age', y ='blood_pressure', data = df)
<Axes: xlabel='age', ylabel='blood_pressure'>
Insights:¶
The plot shows a positive correlation between age and blood pressure. As age increases, blood pressure also tends to increase. The data points form a triangular or "cone" shape, with the spread of blood pressure values becoming wider as age increases.
sns.scatterplot(x = 'age', y ='blood_pressure',hue = 'class', data = df, palette = 'coolwarm')
<Axes: xlabel='age', ylabel='blood_pressure'>
sns.boxplot(x = 'diabetes_mellitus' ,y = 'albumin' ,data = df, palette='muted')
<Axes: xlabel='diabetes_mellitus', ylabel='albumin'>
sns.violinplot(x = 'diabetes_mellitus' ,y = 'albumin' ,data = df, palette='muted', inner = 'quartile')
<Axes: xlabel='diabetes_mellitus', ylabel='albumin'>
# Stacked bar chart
pd.crosstab(df['diabetes_mellitus'], df['hypertension'])
| hypertension | no | yes |
|---|---|---|
| diabetes_mellitus | ||
| no | 220 | 41 |
| yes | 31 | 106 |
diabetes_hpertension = pd.crosstab(df['diabetes_mellitus'], df['hypertension'])
diabetes_hpertension.plot(kind = 'bar', stacked = True)
<Axes: xlabel='diabetes_mellitus'>
Multi-variate Analysis¶
cols = ['age', 'blood_pressure', 'blood_glucose_random', 'serum_creatinine', 'class']
df[cols]
| age | blood_pressure | blood_glucose_random | serum_creatinine | class | |
|---|---|---|---|---|---|
| 0 | 48.0 | 80.0 | 121.0 | 1.2 | 1 |
| 1 | 7.0 | 50.0 | NaN | 0.8 | 1 |
| 2 | 62.0 | 80.0 | 423.0 | 1.8 | 1 |
| 3 | 48.0 | 70.0 | 117.0 | 3.8 | 1 |
| 4 | 51.0 | 80.0 | 106.0 | 1.4 | 1 |
| ... | ... | ... | ... | ... | ... |
| 395 | 55.0 | 80.0 | 140.0 | 0.5 | 0 |
| 396 | 42.0 | 70.0 | 75.0 | 1.2 | 0 |
| 397 | 12.0 | 80.0 | 100.0 | 0.6 | 0 |
| 398 | 17.0 | 60.0 | 114.0 | 1.0 | 0 |
| 399 | 58.0 | 80.0 | 131.0 | 1.1 | 0 |
400 rows × 5 columns
g = sns.PairGrid(df[cols], hue = 'class', palette = 'coolwarm')
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot, cmap = 'Blues_d')
g.map_diag(sns.histplot)
g.add_legend()
plt.title('PairGrid for selected columns')
plt.show()
df.corr(numeric_only=True)
| age | blood_pressure | specific_gravity | albumin | sugar | blood_glucose_random | blood_urea | serum_creatinine | sodium | potassium | haemoglobin | packed_cell_volume | white_blood_cell_count | red_blood_cell_count | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 1.000000 | 0.159480 | -0.191096 | 0.122091 | 0.220866 | 0.244992 | 0.196985 | 0.132531 | -0.100046 | 0.058377 | -0.192928 | -0.242119 | 0.118339 | -0.268896 | 0.227268 |
| blood_pressure | 0.159480 | 1.000000 | -0.218836 | 0.160689 | 0.222576 | 0.160193 | 0.188517 | 0.146222 | -0.116422 | 0.075151 | -0.306540 | -0.326319 | 0.029753 | -0.261936 | 0.294077 |
| specific_gravity | -0.191096 | -0.218836 | 1.000000 | -0.469760 | -0.296234 | -0.374710 | -0.314295 | -0.361473 | 0.412190 | -0.072787 | 0.602582 | 0.603560 | -0.236215 | 0.579476 | -0.732163 |
| albumin | 0.122091 | 0.160689 | -0.469760 | 1.000000 | 0.269305 | 0.379464 | 0.453528 | 0.399198 | -0.459896 | 0.129038 | -0.634632 | -0.611891 | 0.231989 | -0.566437 | 0.627090 |
| sugar | 0.220866 | 0.222576 | -0.296234 | 0.269305 | 1.000000 | 0.717827 | 0.168583 | 0.223244 | -0.131776 | 0.219450 | -0.224775 | -0.239189 | 0.184893 | -0.237448 | 0.344070 |
| blood_glucose_random | 0.244992 | 0.160193 | -0.374710 | 0.379464 | 0.717827 | 1.000000 | 0.143322 | 0.114875 | -0.267848 | 0.066966 | -0.306189 | -0.301385 | 0.150015 | -0.281541 | 0.419672 |
| blood_urea | 0.196985 | 0.188517 | -0.314295 | 0.453528 | 0.168583 | 0.143322 | 1.000000 | 0.586368 | -0.323054 | 0.357049 | -0.610360 | -0.607621 | 0.050462 | -0.579087 | 0.380605 |
| serum_creatinine | 0.132531 | 0.146222 | -0.361473 | 0.399198 | 0.223244 | 0.114875 | 0.586368 | 1.000000 | -0.690158 | 0.326107 | -0.401670 | -0.404193 | -0.006390 | -0.400852 | 0.299969 |
| sodium | -0.100046 | -0.116422 | 0.412190 | -0.459896 | -0.131776 | -0.267848 | -0.323054 | -0.690158 | 1.000000 | 0.097887 | 0.365183 | 0.376914 | 0.007277 | 0.344873 | -0.375674 |
| potassium | 0.058377 | 0.075151 | -0.072787 | 0.129038 | 0.219450 | 0.066966 | 0.357049 | 0.326107 | 0.097887 | 1.000000 | -0.133746 | -0.163182 | -0.105576 | -0.158309 | 0.084541 |
| haemoglobin | -0.192928 | -0.306540 | 0.602582 | -0.634632 | -0.224775 | -0.306189 | -0.610360 | -0.401670 | 0.365183 | -0.133746 | 1.000000 | 0.895382 | -0.169413 | 0.798880 | -0.768919 |
| packed_cell_volume | -0.242119 | -0.326319 | 0.603560 | -0.611891 | -0.239189 | -0.301385 | -0.607621 | -0.404193 | 0.376914 | -0.163182 | 0.895382 | 1.000000 | -0.197022 | 0.791625 | -0.741427 |
| white_blood_cell_count | 0.118339 | 0.029753 | -0.236215 | 0.231989 | 0.184893 | 0.150015 | 0.050462 | -0.006390 | 0.007277 | -0.105576 | -0.169413 | -0.197022 | 1.000000 | -0.158163 | 0.231919 |
| red_blood_cell_count | -0.268896 | -0.261936 | 0.579476 | -0.566437 | -0.237448 | -0.281541 | -0.579087 | -0.400852 | 0.344873 | -0.158309 | 0.798880 | 0.791625 | -0.158163 | 1.000000 | -0.699089 |
| class | 0.227268 | 0.294077 | -0.732163 | 0.627090 | 0.344070 | 0.419672 | 0.380605 | 0.299969 | -0.375674 | 0.084541 | -0.768919 | -0.741427 | 0.231919 | -0.699089 | 1.000000 |
plt.figure(figsize= (10,8))
sns.heatmap(df.corr(numeric_only=True), cmap = 'coolwarm', annot= True)
<Axes: >
sns.swarmplot(x = 'diabetes_mellitus', y = 'age', hue = 'hypertension', data = df, palette = 'pastel', size=8)
<Axes: xlabel='diabetes_mellitus', ylabel='age'>
fig = px.scatter(df, x = 'age', y = 'blood_pressure', color = 'class', hover_data = ['serum_creatinine', 'haemoglobin'], title = 'Interactive scatter plot')
fig.show()
fig = px.scatter_3d(df, x = 'age', y = 'blood_pressure', z = 'serum_creatinine', color = 'class', title = '3D SCATTER PLOT')
fig.show()
fig = px.scatter_3d(df, x = 'age', y = 'blood_pressure', z = 'serum_creatinine', color = 'haemoglobin', title = '3D SCATTER PLOT')
fig.show()
import plotly.graph_objects as go
corr = df.corr(numeric_only=True)
fig = go.Figure(data = go.Heatmap(z = corr.values, x = corr.columns, y = corr.index))
fig.show()
df.isnull().sum()
age 9 blood_pressure 12 specific_gravity 47 albumin 46 sugar 49 red_blood_cells 152 pus_cell 65 pus_cell_clumps 4 bacteria 4 blood_glucose_random 44 blood_urea 19 serum_creatinine 17 sodium 87 potassium 88 haemoglobin 52 packed_cell_volume 71 white_blood_cell_count 106 red_blood_cell_count 131 hypertension 2 diabetes_mellitus 2 coronary_artery_disease 2 appetite 1 peda_edema 1 aanemia 1 class 0 dtype: int64
Categorical
['red_blood_cells', 'pus_cell', 'pus_cell_clumps', 'bacteria', 'hypertension', 'diabetes_mellitus', 'coronary_artery_disease', 'appetite', 'peda_edema', 'aanemia', 'class']
Numerical
['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar', 'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium', 'potassium', 'haemoglobin', 'packed_cell_volume', 'white_blood_cell_count', 'red_blood_cell_count']
median = df[Numerical].median()
median
age 55.00 blood_pressure 80.00 specific_gravity 1.02 albumin 0.00 sugar 0.00 blood_glucose_random 121.00 blood_urea 42.00 serum_creatinine 1.30 sodium 138.00 potassium 4.40 haemoglobin 12.65 packed_cell_volume 40.00 white_blood_cell_count 8000.00 red_blood_cell_count 4.80 dtype: float64
df[Numerical] = df[Numerical].fillna(median)
df[Numerical].isnull().sum()
age 0 blood_pressure 0 specific_gravity 0 albumin 0 sugar 0 blood_glucose_random 0 blood_urea 0 serum_creatinine 0 sodium 0 potassium 0 haemoglobin 0 packed_cell_volume 0 white_blood_cell_count 0 red_blood_cell_count 0 dtype: int64
mode = df[Categorical].mode().iloc(0)
mode
<pandas.core.indexing._iLocIndexer at 0x2b655404aa0>
df[Categorical] = df[Categorical].fillna(mode)
df[Categorical].isna().sum()
red_blood_cells 0 pus_cell 0 pus_cell_clumps 0 bacteria 0 hypertension 0 diabetes_mellitus 0 coronary_artery_disease 0 appetite 0 peda_edema 0 aanemia 0 class 0 dtype: int64
df.dtypes
age float64 blood_pressure float64 specific_gravity float64 albumin float64 sugar float64 red_blood_cells object pus_cell object pus_cell_clumps object bacteria object blood_glucose_random float64 blood_urea float64 serum_creatinine float64 sodium float64 potassium float64 haemoglobin float64 packed_cell_volume float64 white_blood_cell_count float64 red_blood_cell_count float64 hypertension object diabetes_mellitus object coronary_artery_disease object appetite object peda_edema object aanemia object class int64 dtype: object
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in Categorical:
# Convert the column to string type before fitting the encoder
df[col] = label_encoder.fit_transform(df[col].astype(str))
df.dtypes
age float64 blood_pressure float64 specific_gravity float64 albumin float64 sugar float64 red_blood_cells int32 pus_cell int32 pus_cell_clumps int32 bacteria int32 blood_glucose_random float64 blood_urea float64 serum_creatinine float64 sodium float64 potassium float64 haemoglobin float64 packed_cell_volume float64 white_blood_cell_count float64 red_blood_cell_count float64 hypertension int32 diabetes_mellitus int32 coronary_artery_disease int32 appetite int32 peda_edema int32 aanemia int32 class int32 dtype: object
df[Categorical]
| red_blood_cells | pus_cell | pus_cell_clumps | bacteria | hypertension | diabetes_mellitus | coronary_artery_disease | appetite | peda_edema | aanemia | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 1 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | 1 |
| 3 | 2 | 1 | 2 | 1 | 2 | 1 | 1 | 2 | 2 | 2 | 1 |
| 4 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 395 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
| 396 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
| 397 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
| 398 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
| 399 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
400 rows × 11 columns
df.head()
| age | blood_pressure | specific_gravity | albumin | sugar | red_blood_cells | pus_cell | pus_cell_clumps | bacteria | blood_glucose_random | ... | packed_cell_volume | white_blood_cell_count | red_blood_cell_count | hypertension | diabetes_mellitus | coronary_artery_disease | appetite | peda_edema | aanemia | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 48.0 | 80.0 | 1.020 | 1.0 | 0.0 | 0 | 2 | 1 | 1 | 121.0 | ... | 44.0 | 7800.0 | 5.2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 |
| 1 | 7.0 | 50.0 | 1.020 | 4.0 | 0.0 | 0 | 2 | 1 | 1 | 121.0 | ... | 38.0 | 6000.0 | 4.8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 62.0 | 80.0 | 1.010 | 2.0 | 3.0 | 2 | 2 | 1 | 1 | 423.0 | ... | 31.0 | 7500.0 | 4.8 | 1 | 2 | 1 | 2 | 1 | 2 | 1 |
| 3 | 48.0 | 70.0 | 1.005 | 4.0 | 0.0 | 2 | 1 | 2 | 1 | 117.0 | ... | 32.0 | 6700.0 | 3.9 | 2 | 1 | 1 | 2 | 2 | 2 | 1 |
| 4 | 51.0 | 80.0 | 1.010 | 2.0 | 0.0 | 2 | 2 | 1 | 1 | 106.0 | ... | 35.0 | 7300.0 | 4.6 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
5 rows × 25 columns